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Abstract 

The problem of constructing an optimal rooted phylogenetic network from an arbitrary set of rooted triplets is an NP-hard 
problem. In this paper, we present a heuristic algorithm called TripNet, which tries to construct a rooted phylogenetic 
network with the minimum number of reticulation nodes from an arbitrary set of rooted triplets. Despite of current 
methods that work for dense set of rooted triplets, a key innovation is the applicability of TripNet to non-dense set of rooted 
triplets. We prove some theorems to clarify the performance of the algorithm. To demonstrate the efficiency of TripNet, we 
compared TripNet with SIMPLISTIC. It is the only available software which has the ability to return some rooted 
phylogenetic network consistent with a given dense set of rooted triplets. But the results show that for complex networks 
with high levels, the SIMPLISTIC running time increased abruptly. However in all cases TripNet outputs an appropriate 
rooted phylogenetic network in an acceptable time. Also we tetsed TripNet on the Yeast data. The results show that Both 
TripNet and optimal networks have the same clustering and TripNet produced a level-3 network which contains only one 
more reticulation node than the optimal network. 
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Introduction 

Phylogenetic networks are a generalization of phylogenetic trees 
that permit the representation of non-tree-like underlying histories. 
A rooted phylogenetic network is a rooted directed acyclic graph 
in which no node has indegree greater than 2 and the outdegree of 
each node with indegree 2 is 1 . Such nodes are called reticulation 
nodes. In rooted phylogenetic networks the nodes with indegree 1 
and outdegree 0 are called leaves and are distinctly labeled by a set 
of given taxa. Mathematicians are interested in developing 
methods that infer a phylogenetic tree or network from basic 
building blocks. In the computation of a rooted tree or network, 
one group of the basic building blocks are rooted triplets, the 
rooted binary trees on three taxa [1]. 

In 1981, Aho et al., studied the problem of constructing a rooted 
tree from a set of rooted triplets [2] . They proposed an algorithm 
called BUILD algorithm which shows that, given a set of rooted 
triplets, it is possible to construct in polynomial time a rooted tree 
that all the input triplets are contained in it or decide that no such 
tree exists. 

When there is no tree for a given set of triplets one may try to 
produce an optimal phylogenetic network. In this context, the goal 
is to compute an optimal rooted phylogenetic network that 
contains all the rooted triplets. One possible optimality criterion is 
to minimize the level of the network, which is defined as the 
maximum number of reticulation nodes contained in any 



biconnected component of the network. The other optimality 
criterion is to minimize the number of reticulation nodes [1]. In 
[3] and [4] the authors considered the problem of deciding 
whether, given a set of rooted triplets as input, is it possible to 
construct a level- 1 rooted phylogenetic network that contains all 
the input triplets? They showed that, in general, this problem is 
NP-hard. However, in [4] the authors showed that when the set of 
rooted triplets is dense, which means that for each set of three taxa 
there is at least one rooted triplet in the input set, the problem can 
be solved in polynomial time. After their results, all research in this 
new area has up to this point focused on constructing rooted 
phylogenetic networks from dense rooted triplet sets. 

LEVIATHAN is an algorithm for generating a level- 1 rooted 
phylogenetic network from a set of rooted triplets [5] . Specifically, 
it attempts to find a level- 1 rooted phylogenetic network that 
contains as many of the input rooted triplets as possible. This 
problem is an NP-hard problem [5] . The algorithm by [6] can be 
used to find a level- 1 or a level-2 rooted phylogenetic network 
which minimizes the number of reticulation nodes, if such a 
network exists. In [6] the authors also showed that for a dense set 
of rooted triplets x, if x is precisely equal to the set of rooted triplets 
that are contained in some rooted phylogenetic network, then they 
can construct such a rooted phylogenetic network with smallest 
possible level in time 0( | x | k+> ), where k is a fixed upper bound on 
the level of the network. In addition based on the ideas described 
in [6], for a given dense set of rooted triplets x, the authors 
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(a) (b) 

Figure 1. A triplet and a network consistent with it. (a) The triplet 
$ij|k$, (b) The triplet ij\k is consistent with the given network. 
doi:1 0.1 371 /journal.pone.01 06531 .g001 

proposed the SIMPLISTIC algorithm which always returns some 
rooted phylogenetic network that contains T. But it does not give 
any minimality guarantees. 

In [7] the authors showed that given a dense set of rooted 
triplets x and a fixed number k, it is possible to construct in time 
0{ | x | h+l ) a level-A rooted phylogenetic network that contains T or 
decides that no such network exists. 

In this paper we present a heuristic algorithm called TripNet for 
constructing rooted phylogenetic networks with the minimum 
number of reticulation nodes from an arbitrary set of rooted 
triplets. Despite of current methods that work for dense set of 
rooted triplets, a key innovation is the applicability of TripNet to 
non-dense set of rooted triplets. 

In "unpublished data" the authors applied TripNet on both real 
and simulated data. Here TripNet algorithm is described in 
details, some theorems are proved, and one simulation is 
performed to show the accuracy of TripNet. Also TripNet is 
tested on the Yeast data. This paper is organized as follows. In 
section 2, first some definitions and notation are presented. Then 
we describe BUILD algorithm. Finally a new method called TGD, 
is introduced for constructing rooted triplets from (biological) 
sequences. In section 3 we compare TripNet with SIMPLISTIC 
on the triplets sets that are obtained from TCD method. Then we 
test TripNet on the Yeast data. In section 4 we discuss the 
performance of TripNet. In the last section the directed graph Gx 
related to a set of triplets x is introduced. Then we show that if 
either a set of triplets is obtained from a set of sequences using 
TCD method or a set of triplets is consistent with a tree, then G T is 
a DAG. This property has a key role in solving the Integer 
Programming system which is introduced in the remaining, in 
polynomial time. Then the concept of the height function of a 
rooted phylogenetic network is introduced, and an efficient 
method for obtaining a height function h z for a given set of 
rooted triplets x is explained. It is shown that the condition of 
consistency of a rooted phylogenetic network N with the height 
function h x can be a good alternative for the condition of 
consistency of N with x. To show this, firstly we define the Integer 
Programming system in such a way that its constraints intuitively 
force the consistency of N with x. Secondly, we show that if x is 
consistent with a tree T, then T is consistent with h T and T can be 
constructed using this height function. In the last section we 
present TripNet algorithm. 



Preliminaries 

Here first we present some definitions and notation. Then we 
describe BUILD algorithm. Finally a new method called TCD, is 
introduced for constructing rooted triplets from a set of sequences. 

2.1 Definitions and notation 

Let X be a set of taxa. A rooted phylogenetic tree {tree for short) 
on X is a rooted unordered leaf labeled tree whose leaves are 
distinctiy labeled by X and every node which is not a leaf has at 
least outdegree two. A directed acyclic graph (DAG) is a directed 
graph that is free of directed cycles. A DAG G is connected if there 
is an undirected path between any two nodes of G. It is biconnected 
if it contains no node whose removal disconnects G. A biconnected 
component of a graph G is a maximal biconnected subgraph of G. 
A rooted phylogenetic network (network for short) on X is a rooted 
DAG in which the root has indegree 0 and outdegree 2 and every 
node except the root satisfies one of the following conditions: 

a) It has indegree 2 and outdegree 1 . These nodes are called 
reticulation nodes. 

b) It has indegree 1 and outdegree 2. 

c) It has indegree 1 and outdegree 0. These nodes are called 
leaves and are distincdy labeled by X. 

A reticulation leafh a leaf whose parent is a reticulation node. A 
network is said to be a level-k network if each of its biconnected 
components contains at most k reticulation nodes. A tree can be 
considered as a level-0 network. 

A rooted triplet {triplet for short) is a rooted binary unordered 
tree with three leaves. We use ij\k to denote a triplet with taxa i 
andj on one side and k on the other side of the root (Figure la). A 
set of triplets x is called dense if for each subset of three taxa, there 
is at least one triplet in x. A triplet ij\k is consistent with a network 
N or equivalently N is consistent with ij \ k if the leaf set of ij | k is a 
subset of the leaf set of N, and N contains a subdivision of ij \ k, i.e. 
if JV contains distinct nodes u and v and pairwise internally node- 
disjoint paths u — > i, u v — > u and v —* k. Figure lb shows an 
example of a network consistent with ij\k. A set x of triplets is 
consistent with a network N if all the triplets in x are consistent 
with N. We use the symbols x{N) and L N to represent the set of all 
triplets that are consistent with N and the set of labels of its leaves 
respectively. For any set x of triplets define L(x) = (J, ET L,. The set X 
is called a set of triplets on X if L(x) = X. 

2.2 BUILD algorithm 

Let x be a set of triplets. BUILD is a top-down algorithm, 
constructs a tree consistent with x if such a tree exists. The 
algorithm is guided by the Alio graph. 

Definition 1. (Aho graph) Let X be a set of taxa and x be a set 
of triples on X. The Aho graph AG(x) = (V,E) associated with x has 
node set V = X and any two nodes i and j are connected by an 
edge in E if and only if there exists a triplet ij\k e x [1]. 

BUILD algorithm: Given a non-empty set of rooted triples x on 
X, the aim is to construct a rooted phylogenetic tree T on X that is 
consistent with x, if one exists. If AG(x) has only one connected 
component, then the algorithm reports fail. Else, for each node set 
U of a connected component of AG(x), determine the set x | v 
which denotes the set of all triplets in x whose leaves are in U and 
recursively compute the rooted phylogenetic subtree T(x | v ) which 
denotes the tree constructed with BUILD algortihm consistent 
with x | jj. Finally, create a root node r and combine all computed 
subtrees by connecting r to the root of each of them [1]. For an 
example see Figure 2. 
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Figure 2. An example of BUILD algorithm for the given set {be \a, ac \d, de \b) of triplets. 

doi:1 0.1 371 /journal.pone.01 06531 .g002 



2.3 Triplets construction method 

There exist different methods like Maximum Parsimony or 
Maximum Likelihood for constructing triplets from (biological) 
sequences [6] . In this section a method for constructing triplets is 
presented. Suppose that X is a set of n taxa, and D = [Dy] be an 
n xn distance matrix on X. Tor each three taxa i,j, and k e X, and 
the entries D y ; D^, and Djf,, we assign the triplet ij\k if -Dy < min 
{Dik, Djk}. We name this method Triplets Construction with 
Distance; TCD for short. In this paper we use TCD method for 
constructing triplets. 

Results 

In this section to show the performance of TripNet on the 
triplets sets which are obtained from TCD, we compare TripNet 
with SIMPLISTIC. Also we test TripNet on the Yeast data. It is 



the only published triplets data that are obtained from biological 
data. 

3.1 Comparing SIMPLISTIC and TripNet 

SIMPLISTIC is the only available software which has the 
ability to return some rooted phylogenetic network consistent with 
a given dense set of rooted triplets. But it does not give any 
minimality guarantees [6]. 

SplitsTree is a valuable tool for constructing an special kind of 
unrooted phylogenetic networks from different types of data as 
input. This program converts a given set of sequences X into a 
distance matrix D x to compute the resulting network. The distance 
matrix D x is reported as one of the output of SplitsTree [8] . 

Let Zd x be the set of triplets that is obtained from D x using 
TCD, and consider it as the input for TripNet. 
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Table 1. SIMPLISTIC and TripNet network results. 





Number of sequences 


10 


20 


30 


40 


Number of samples 


40 


40 


40 


40 


Number of the TripNet networks e Nf init e 


40 


40 


40 


40 


Number of the SIMPLISTIC networks 6 N WK 


40 


38 


13 


0 


TripNet avg runningtime for networks e N fmite (Sec) 


1 


1.75 


200 


775 


SIMPLISTIC avg runningtime for networks e N finjte (Sec) 


1 


306 


2675 




TripNet avg number of reticulations for networks e Nf, n , te 


0.65 


2.275 


7.4 


15.825 


SIMPLISTIC avg number of reticulations for networks N finite 


2.325 


6.95 


1 1 .275 




TripNet avg level for networks N !jnjte 


0.65 


1.825 


6.95 


15.25 


SIMPLISTIC avg level for networks N flnite 


2.05 


4.2 


6.95 





160 different sets of sequences are generated using TREEVOLVE. the parameters Number of samples, the Number of sequences, and the Length of sequence are adjusted, 
and for the other parameters the default values are adjusted. Number of sequences is 10, 20, 30, and 40. For each input parameter the Number of sequences the Length of 
sequence is 100, 200, 300, and 400. For each case the Number of samples is set to 10. N finite is the set of networks for which the running time is less than 6 hours. 
doi:1 0.1 371 /journal.pone.01 06531 .t001 



Note that Zo x is not necessarily dense, since for some three taxa 
and k we might have Dx« = Dx Jk <Dx, k ■ In this case one of the 
triplets ij \ h or jk \ i is assigned to i, j, and k to obtain a dense set 
of triplets ix dl „ as the input of SIMPLISTIC. Also if 
Dx„ = Dx lk = Dx lt , then randomly one of the three possible triplets 
related to i, j and k is assigned to them. 

To perform the simulation we generate 160 different sets of 
sequences are generated using TREEVOLVE. TREVOLVE is a 
software which simulate the evolution of DNA sequences under a 




(a) 

Figure 3. Resulting networks from Yeast triplets, (a) LEVEL2 al 

doi:1 0.1 371 /journal.pone.01 06531 .g003 



coalescent model [9]. TREEVOLVE contains many input 
parameters which one can adjust them. In this study we adjust 
the Number of samples, the Number of sequences, and the Length 
of sequence, and for the other parameters the default values are 
adjusted. In this study the Number of sequences is 10, 20, 30, and 
40. For each input parameter the Number of sequences the Length 
of sequence is 100, 200, 300, and 400. For each case the Number of 
samples is set to 10. 

In this study we run both methods on a PC with an Intel 
DuallCore processor running at 1.80 GHz. 




1 19 



(b) 

ithm result, (b) TripNet algorithm result. 
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Figure 4. The steps of constructing T z from the given set x = {kl\j, kl\i, jk\i, using HBUILD. (a) The graph G T , (b) The graph (G,h), (c) 
Removing maximum weights from the graph (G,h), (d) Constructing T T using step c. 
doi:1 0.1 371 /journal.pone.01 06531 .g004 



We set the running time restriction 6 hours for methods. Let 
N f ln i te be the set of networks for which the running time is less than 
6 hours. 

The results of the comparison between TripNet and SIMPLIS- 
TIC on the three most important parameters i.e. running time of 
both methods, number of the reticulation nodes and the level of 
the final networks, are shown in Table 1. 



The results show that when the number of input taxa is 1 0, both 
methods always return a network in at most one second. For the 
number of input 20, in 5% of cases SIMPLISTIC returns no 
results in less than 6 hours. For the remaining 95% of the cases, 
the SIMPLISTIC running time is on average 306 seconds, while 
in all cases on average the TripNet running time is at most 
2 seconds. But by increasing this parameter to 30, in 67.5% of the 
cases, SIMPLISTIC has not the ability to return a network in less 





I 





Figure 5. An example of binarization. The binary tree is 
binarization of the non-binary tree. 
doi:1 0.1 371 /journal.pone.01 06531 .g005 



I 



Figure 6. Two different networks with the same height 
function. For the given network N and tree T, h N = h T =h. h(j,k) = '\, 
h(ij) = h(i,k) = 2 and h[i,\) = h(j,l) = h(k,l) = 3. 
doi:1 0.1 371 /journal.pone.01 06531 .g006 
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Figure 7. A counter example for the reverse of Theorem 2. ij\k is 

consistent with the given network, but h(i,j) = h(i,k) = i and h(j,k) = 2. 
doi:10.1371/journal.pone.01 06531. g007 

than 6 hours. For the remaining 22.5% of the cases on average 
SIMPLISTIC outputs a network in 2675 seconds, while in all 
cases the TripNet running time is on average 200 seconds. 
Moreover when this parameter is set to 40, in all cases 
SIMPLISTIC fails to return any network in less than 6 hours, 
while on average TripNet outputs a network in 775 seconds. 
Totally for all 160 input triplets sets on average TripNet outputs a 
network in less than 250 seconds, while on average in 57% of the 
SIMPLISTIC networks which belong to Afyj mfe , the running time is 
near to 750 seconds. 



Also the results show that in all cases the number of the 
reticulation nodes and the level of TripNet networks are less than 
SIMPLISTIC networks. Note that for the number of input 40, on 
average the number of the reticulation nodes and the level of the 
TripNet networks are 15.825 and 15.25, while for these data 
SIMPLISTIC can not return any network in less than 6 hours. 

3.2 Yeast data 

The Yeast data is a dense set of triplets generated using real 
yeast data, obtained from the Fungal Biodiversity Center in 
Utrecht. This data set which contains information about 2 1 species 
is available online from (http://skelk.sdf-eu.org/level2triplets. 
html). Based on the algorithm developed in [10]. Steven Kelk 
has developed a software application, called LEVEL2, for 
constructing level-2 networks from dense sets of triplets. LEVEL2 
is not applicable to general triplet sets and it produces a network 
only if there exists a level-2 network consistent with the input 
triplets. However, LEVEL2 has the advantage that it always 
produces the best possible network which also minimizes the 
number of reticulation nodes. LEVEL2 network for the Yeast data 
is a 21 -leaf level-2 network which is given in Figure 3a [10]. As our 
only chance for comparing TripNet networks with the best 
possible networks we repeated the analysis of Yeast data using 
TripNet. The TripNet network for the Yeast dataset is given in 
Figure 3b. As one can see, TripNet produced a level-3 network 
which contains only one more reticulation node than the network 
obtained by LEVEL2. The running time of both algorithms is 
nearly one second. 

Discussion 

In this paper we introduced TripNet which is the software that 
has the ability to return some network consistent with an arbitrary 
given set of triplets. TripNet and supplementary files are freely 
available for download at (www.bioinf.cs.ipm.ir/software/tripnet). 
Unlike previous methods which only work on dense triplet sets, 
our method works on any set of triplets. Some theorems were 
proved to clarify the rationale behind the steps of TripNet. In this 
paper the TCD method was introduced for constructing triplets. 
In order to study the performance of TripNet on the triplets that 
are obtained from TCD method we performed a simulation on 
160 different sets of triplets, and compared TripNet with 
SIMPLISTIC. 

The results showed that in all 160 cases TripNet outputs an 
appropriate network in an acceptable time, while just in 57.5% of 
these cases SIMPLISTIC has the ability to return some network in 
less than 6 hours. Also on average in all cases TripNet outperforms 
SIMPLISTIC on the number of the reticulation nodes, and the 
level of the output network. 

Also by increasing the number of input taxa, the running time of 
SIMPLISTIC exceeds abruptly, such that for the input taxa 40, it 
could not return any network in less than 6 hours. 

These results showed that for large size input data that are 
obtained from TCD method, SIMPLISTIC is not a practical 
method for constructing networks, while TripNet works well in all 
cases. 

To establish the performance of TripNet on real datasets, we 
tested TripNet on Yeast data, and compared our results with those 
of LEVEL2. For Yeast data TripNet produced a level-3 network 
which contains only one more reticulation node than the optimal 
network obtained by LEVEL2. Both networks have the same 
clustering and represent the same evolutionary relationship 
between taxa. While TripNet has been designed for general 
triplet sets (not necessarily dense or consistent with a restricted 
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(I) 



(m) 
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Figure 8. An example to show how TripNet works to find a reticulation leaf by applying step 5. Edges with weight 6 are shown by dotted 
lines, (a) x = {ij\l,jk\i, kl\j, kl\i, no\m, lo\k,jl\o, mn\l, mn\j, no\k, mo\i,jk\n, ij\o, ik\m, il\n} is not consistent with a tree and is consistent with the given level-1 
network, (b) G' t is obtained from G T by removing the dotted line, (c) Computing (G, h), (d) Remove edges with weights 6 and 5 from (G, h) to obtain 
SN-sets {n, o} and {m}, (e) Remove edges with weights 4 and 3 from the remaining graph to obtain SN-set {/}, (f) Remove edges with weights 2 from 
the remaining graph to obtain SN-set {/}, (g) Remove edges with weights 1 from the remaining graph to obtain SN-sets {k} and {/}, (h) Compute G s . 
both SN-sets {k} and {/} satisfy Criteria I and II, (i) Remove {/} from G s , (j) Remove edges with weights 6 from the graph of previous step to obtain SN- 
sets k] and {m, n, o], (k) Remove {k} from G s , (I) Remove edges with weights 6 and 5 from the graph of previous step to obtain SN-sets {n, o] and 
{m}, (m) Remove edges with weights 4 from the remaining graph to obtain SN-set {/}, (n) Remove edges with weights 3 from the remaining graph to 
obtain SN-sets {/} and {/}. The steps i to n shows that / is the reticulation leaf. In these steps criterion III is applied. 
doi:1 0.1 371 /journal.pone.01 06531 .g008 
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i The (G, h) graph. 



iii Final SN-sets: {i}, {j}, {k,l}, 
{m,n}. 



m 1 




Remove edges with weights 5 
and 4. Continue the process on 
non SN-set {j, k, I, m, n}. 




Updated graph on SN-sets. Updated 
iv triplets: {k, l}{m, n}|i, {k, l}{m, n}|j, 
j{m, n}|{k, I}, j{k, l}|{m, n}, j{m, n}|i, 
i{k, l}|{m, n}j{k, l}|i, i{k, l}|j . 




Remove first and second selected 
v reticulation leaves {m, n} and {k, I} vi 
to obtain a tree. 



{k, l} 

Add {k, I} to the network. 




{m, n} 

vii Add {m, n} to the network. 




m n 

viii Replace SN-sets. 



Figure 9. Steps of TripNet for input triplets: jk \i, mj\i, jn\i, kl\i, ik\m, ik \n, lm\i, ln\i, mn\i, kl\j, km\j, jn\k, lnfy,jl\n, mn\j, kl\m, kl\n, 
mn\k, mn\l}. 

doi:1 0.1 371 /journal.pone.01 06531 .g009 



PLOS ONE | www.plosone.org 



8 



September 2014 | Volume 9 | Issue 9 | e106531 



TripNet: A Method for Constructing Networks from Triplets 



level network), this example shows that the network produced by 
TripNet is very close to the best possible solution. 

Materials and Methods 

In this section we prove some theorems to clarify the rationale 
behind the steps of TripNet. Then TripNet is presented in nine 
steps. 

5.1 The directed graph related to a set of triplets and 
height function 

Throughout this subsection we denote i, j by ij for short. Let x 
be a set of triplets. Define G x , the directed graph related to x, by 
V(G T ) = {ij: ij e L(x), i f j} and £(G T ) = {(ij,ik): ij\kex}U {(ijjk): 
ij \k e x}. In the following we present some basic properties of G T . 

In what follows the height function of a tree is introduced. Let 

( 2 ) denotes the set of all subsets of X of size 2. 

Definition 2. Let X be an arbitrary finite set. A function h: 
( 2 ) ~ * N is called a height function on X. 

Let T be a rooted tree with the root r, Cy be the lowest common 
ancestor of the leaves i andj, and It denotes the length of a longest 
path starting at r. 

Definition 3. The height function of T, h T is defined as 
hrQj) = IrdiiTiCij) where i and j are two distinct leaves of T 
(dj{r,Cij) denotes the length of the path between r and c y ). 

Let T be a tree. The definition above implies that a triplet ij \ h is 
consistent with T if and only if hj{i,j)<h T (i, k) or hj{i, j)<hjij , k). 

LetX = {xi, x 2 , x m ) be a finite set, D be a distance matrix on 
X, and t be the set of triplets on X that are obtained from TCD 
method using D. Let G x contains a cycle X\X 2 — * X 2 X 3 — * ••• — * 
*n-i«« -» x i x 2- Then D XlX2 <D X2X} < ...<D XnlXn <D X1X2 , which 
is a contradiction. So G T is a DAG. 

Moreover if x is a triplet set consistent with a tree T, then G T is a 
DAG. This is so because if G x contains a cycle X1X2 — » x 2 x 3 — ■» ... 
— ^ x n —\X n — ^ X[X 2 , then /iy(xi,x 2 ) ^ h , ji x 2] x '*>) ^ ... ^ hj{x n —i^x n ^ 
< /»7-(Xi,x 2 ), which is a contradiction. 

The height function of a DAG is introduced as what follows. 

Let x be a set of triplets, G T be a DAG and l Gl denotes the length 
of the longest path in G T . Since G T is a DAG, the set of nodes with 
outdegree zero is nonempty. Assign l G ,+\ to the nodes with 
outdegree zero and remove them from G T . Assign l Gt to the nodes 
with outdegree zero in the resulting graph and continue this 
procedure until all nodes are removed. 

Definition 4. For any two distinct i,j e L(x), define h Gi (ij) as 
the value that is assigned by the above procedure to the node ij 
and call it the height function related to G x . 

Let x be a set of triplets that is consistent with a tree, and T x 
denotes the unique tree that is produced by BUILD algorithm. 
Then G x is a DAG and h Gi is well-defined. The following theorem 
represents an upper bound for hj z based on h Gi . 

Theorem 1 . Let X be a set of triplets that is consistent with a 
tree. Then Ht, < h G% . 

Proof. The proof proceeds by induction on \Lt,\. It is trivial 
when \Lx t \ = 3. Assume that theorem holds when \Lx,\<k. Let 
\Lx t \—k+l and T b T 2 , T m be m subtrees which are obtained 
from T T by removing its root. For each i, \<i<m, let X; = x\^ T , and 
r, be the root of T,-. By the induction assumption for each i, 
1 < i<m,hj t —h Gx . Moreover we conclude from BUILD algorithm 
that T t = T Zl , for 1 <i<m. Thus hj t <h G , ., for 1 <i<nt. Also for i, 
\<i<m the maximum length of the longest path in T, is It, — 1 . It 
means that for i, \<i<m, the maximum length of the longest path 



in G T , is at least It, —2. Therefore the length of the longest path in 
G x is at least It, — I. Let a, beLT,. We have two cases. 

Case 1. For some i andj, \<i<j <m, aeLT, and 6eLj;.. Since 
the outdegree of ab in G T is zero and c a /, = r, then h.T,(fl,V) = lT,— 
h G ,(a,b). 

Case 2. For some i, \<i<m, a,b eLj r By the induction assu- 
mption hT,.{a,b)^hG,.(a,b) for i, \<i<m. Therefore hT x ( a >b) 
= h, -d T X r ^ab) = lT T -(d Txi {r h c ah )+\)= {l T , -l T ,, - l) + (/r, ( - 
d T ,, iruCab)) = (l T , - l T ,. -l)+h T ,. (a,b)<(l T , - l T ,. - 1) + h G ,. (a,b) 
<A(j t (a,b). The last inequality is obtained by construction of G T from 
G Zi for i, \<i<m. 

So for each a,b eLx,, hT,(a,b)<hG % (a,b) and the proof is 
complete. 

Now we describe an algorithm similar to BUILD algorithm, 
using height functions. We refer to this algorithm by HBUILD. 
Let A be a height function on X. Define a weighted complete graph 
(G,h) where V(G) = X and edge {i,j} has weight h(ij). Remove the 
edges with maximum weight from G. If removing these edges 
results in a connected graph the algorithm stops. Otherwise, the 
process of removing the edges with maximum weight is continued 
in each connected component until each connected component 
contains only one node. At the end of this procedure one can 
reconstruct the tree by reversing the steps of the algorithm similar 
to BUILD algorithm (see Figure 4). The algorithm above decides 
in polynomial time whether a tree with height function h exists. 

So if x is a set of triplets which is consistent with a tree, then G T 
is a DAG and h Tt (a,b)<h Gx (a,b) = h and HBUILD algorithm 
constructsa tree consistent with x. Note that based on theorem 1 
the tree that is produced by HBUILD is exactly T T 

The HBUILD tree is not necessarily a binary tree. To obtain a 
binary tree consistent with a set of triplets, we do the following 
procedure. 

Let T be a tree and x be a node of T with X\, x 2 , .. ., x^, k> 3 as 
its children. Consider a new node y. Construct T by removing the 
edges (x, Xi), (x, x 2 ), (x, Xk-i) from T and adding the edges (x, 31), 
(y, x{), (y, x 2 ), (y, x^) to T. Continuing the same method for 
each node with outdegree more than 2 a binary tree is obtained, 
and call it a binarization of T (see Figure 5). Obviously, one can 
obtain different binarization of T. Let x be a set of triplets that is 
consistent with a tree T b and T 2 be a binarization of T\. Then x is 
consistent with T 2 . 

In the remaining of this section we generalize the concept of 
height function from trees to networks. This generalization is not 
straightforward because the concept of (lowest) common ancestor 
of two leaves of a network is not well-defined. Let N be a network 
with the root r and In be the length of a longest directed path from 
r to the leaves. For each node u consider d(r,u) as the length of the 
longest directed path from r to u. For any two nodes u and v, we 
call u an ancestor off, if there is a directed path from u to v. If u is 
an ancestor off then we say that v is lower than u. Let i andj be 
two leaves of N. c is called a lowest common ancestor of j andj in N, 
if c is a common ancestor of i and j and there is no common 
ancestor of i andj lower than c. For any two leaves i andj, let Cy 
denote the set of all lowest common ancestors of i and j. 

Definition 5. For each pair of leaves i andj, define h N (ij) = 
min{l^-d(j,c): ceC^} and call it the height function of N. 

Obviously, every network N indicates a unique height function 
h N . But two different networks may have the same height function 
(see Figure 6). 

In the following proposition we prove that for a given height 
function h there is a network N such that h N = h+l. 

Proposition 1. Let X be an arbitrary finite set and h be a 
height function on X. Then there exists a network N not 
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necessarily binary, such that its leaves are distinctly labeled by X 
and h N = h+l. 

Proof. Let X={x b x 2 , x n ] and h max = max{h(Xi, xj: 
1 <ij<n}. Let r be the root of A',, and X' = {x\, x' 2 , x' n }. 
Consider n nodes that are distinctly labeled by X' members. For 
each pair of nodes Xj and Xj with h(Xi, xj) = h max , connect x'j and 
x'j to r by two paths of length h max which just are common in the 
root. For each pair of nodes Xj and Xj with h(Xi, xj) < h max , consider 
a new node and connect x'j and xj to this new node and connect 
this node to r by a path of length h^^-h^i, xj). For each node 
which is labeled by x'j, consider a new node as its child and label it 
by Xj. The resulting network in which its leaves are distincdy 
labeled by X satisfies the condition h N = h+l. 

Note that the network N which is constructed in the proof of 
Proposition 1 is not necessarily a rooted phylogenetic network. To 
construct a rooted phylogenetic network N' from JV in such a way 
that if a triplet is consistent with JV then it is consistent with N' , do 
the following procedure. Replace each path in which all its inner 
nodes have indegree and outdegree one, with a path of length one. 
The method of constructing N shows that If there is a node v with 
indegree d>2, then it has just one child as a leaf. Let this child is 
labeled by x, d>3 and its d parents are labeled by X\, x 2 , x,i. 
Replace the edge which is connected to x with a path of length d-2 
in such a way that its d-2 inner nodes from v to x are labeled with 1 
to d-2. For each i, \<i<d — 2 remove the edge xfli and connect Xj 
to i. Do the binarization on the root. The resulting network N' is 
consistent with all triplets which are consistent with N. 

The following theorem shows relation between the height 
function of a network and a triplet consistent with it. 

Theorem 2. Let N be a network, and k be its three distinct 
leaves. If h N (i, j) < h N (i, k) or h N {i, j) < h N (j, k) then ij | k is 
consistent with N. 

Proof. Suppose that h N (i, j) < h N (i, k). Let v,j$ and be 
common ancestors of i,j and i, k respectively, such that h N (i,j) = 
lfj-d(pij, r) and h N (i, k) = l N -dir,Vik). Let U and Ij be two distinct 
paths from Vy to i and j , respectively. Let Ik be an arbitrary path 
from Vik to k. If h(~\lk¥ 1 0 then it follows that liN(ij)>hN(i,k) 
which is a contradiction. So ij \ k is consistent with N. 

The reverse of the above theorem is not necessarily true. For 
example, consider the network of Figure 7. The triplet ij\k is 
consistent with it, but h{ij) =h(ifi) = 3 and h(j,k) = 2. 

The basic idea of TripNet algorithm is to find a height function 
as an intermediate computational step that yields the minimum 
amount of information required to construct the network from a 
set of triplets. So it is important to find a way for computing h^r 
from a set of triplets. In the rest of this section we introduce a 
computational method for computing h N using Integer Program- 
ming. Let x be a set of triplets with \L(x)\ =n. Inspired from the 
two inequalities that are the consequence of Definition 3 and 
Theorem 2, for each triplet ij \ k e x, define two inequalities 
h(i,k) — h(ij) > 1 and h(j,k) — h(ij) > 1 . Since the number of 

variables in such inequalities are at most |( ^ )L we obtain the 

following system of inequalities from x. 



h(i,k)-h(ij)>\ 
h(j,k)-h(ij)>\ 
L(t) 

0<h(ij)<\( _ ) 



ij | kex, 
ij | kex, 

1 <i,j<n. 



Let s be an integer. Define the following Integer Programming 
and call it IP(v). 



Maximize Yl WJ), 
l <ij<n 

Subject to : h(i,k)—h{ij)>\ ij\kex, 

h(j,k)-h(ij)>\ ij\kex, 

0<h(ij)<s \<i,j<n. 

Intuitively if IP(v) has a feasible solution, we expect that the 
optimal solution to this integer programming is an approximation 
of the height function of an optimal network N consistent with T. 
The following theorems support this intuition. 

Theorem 3. Let x be a set of triplets. Then G x is a DAG if and 
only if for some integer s, the IP(x,.s) has a feasible solution. In this 
case the minimum number s, for which IP(t,.s) has a feasible 
solution, is /G r +L 

Proof. Let G x be a DAG. Without loss of generality assume 
that G x is connected. 

The proof proceeds by induction on If Iq x = 1 then 

obviously for s=l, IP(t,s) has no feasible solution and for each 
s > 2, IP(t,.s) has a feasible solution. Assume that the theorem holds 
for Iq t <k. Suppose that x is a set of triplets with la t =k+l. Let A 
be the set of the terminal nodes of all longest paths in G x . For each 
ij e A there is some x e L(x) such that ix \j e x. Let B be the set of all 
such triplets and x' = x\B. Apparently, B#4> and the length of the 
longest path in G T ' is k. By the induction assumption the minimum 
number s for which IP(t',s) has a feasible solution, is Ig t , +1 = Ig,- 
Consider IP(T,/e r +l). Define h(i, j) = /g,+L for each ij e A and 
h{t,l) - h'(t,l), for each tl $ A. h is a feasible solution to IP(T,/c; t +l). 
Now if s is a solution for IP(t,,s) then s-1 is a solution for IP(x',s-l). 
So /g t +1 is the minimum solution for IP(x,s). Now suppose that x is 
a set of triplets and for some integer s, IP(x,s) has a feasible solution 
h. Assume that G x has a cycle i\j\ — ^iiji— > . . • ->i n Jm— 
Corresponds to C we have inequalities h(i\ji)<h(i2j2)< 
. . . < h(i m jm) < h(h t/l)which is a contradiction and the proof is 
complete. 

Let x be a set of triplets that is consistent with a tree or 
constructed from a given set of taxa, using TCD method. It was 
shown that G T is a DAG and by Theorem 3, hj x is a feasible 
solution to IP(t,/ Gi +1). 

Theorem 4. Let x be a set of triplets consistent with a tree. 
Then hj x is the unique optimal solution to IP(T,/(j r +l). 

Proof. The graph G x is a DAG, since x is consistent with a tree. 
So Ig x is well efined. 

The proof proceeds by induction on Without loss of 
generality assume that G x is connected. The theorem is trivial 
when Iq, = 1 . Let for each set of triplets consistent with a tree, hj t be 
the unique optimal solution to IP(x,/<j t +l) where Ig, = k>\. 
Suppose that x is a set of triplets consistent with a tree and Iq % = k+l. 
Let x' be the set of triplets which is introduced in the proof of 
Theorem 3. By the induction assumption hj, is the unique optimal 
solution to IP(x', /<j, + l). By Theorem 3 the minimum s for which 
IP(t, s) has a feasible solution is /G r +1. Also /g, + 1 = 1g,- It follows 
that hr t is the unique optimal solution to the IP(t,/ Gi +1) and the 
proof is complete. 

It is important to point out that the introduced target function of 
the above IP can be replaced with other appropriate target 
functions. But we use this special target function because it can be 
easily possible to find a solution for this IP in polynomial time 
when the input triplets are obtained from TCD method. Secondly 
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using this target function, enable us to prove those above theorems 
which show the consistency of the result of the TripNet algorithm 
with a tree when there is a tree consistent with given triplets. 

5.2 TripNet algorithm 

Now we describe the TripNet algorithm in nine steps. In this 
algorithm the input is a set of triplets X and the output is a network 
consistent with x. Also if t is consistent with a tree the algorithm 
constructs a binarization of T x . 

Step 1. In this step we find a height function h on L(x). If G z is a 
DAG we set G' T = G z . If G T is not a DAG we remove some edges 
from G z in such a way that the resulting graph G' T is a DAG. Set 
h = hg . 

If t is obtained from a set of taxa using TCD method, then G z is 
a DAG. Removing minimum number of edges from a directed 
graph to make it a DAG is known as the minimum Feedback Arc 
Set problem which is NP-hard [11]. Thus we use the following 
heuristic method and try to remove as minimum number of edges 
as possible from G T in order to lose minimum information. First a 
cycle C is selected randomly. Let C max denote the set of nodes in C 
with the maximum degree. Remove an edge of C which one of its 
ends belongs to C max . This process continues until the resulting 
graph is a DAG. However, any such missing information will be 
recaptured in Step 9. 

Step 2. In this step TripNet first apply HBUILD on h. If the 
result is a tree, TripNet constructs a binarization of this tree. 
Otherwise TripNet goes to Step 3. 

Note that if X is consistent with a tree, TripNet constructs a 
binarization of T z . 

Step 3. Remove all the maximum-weight edges from G. The 
process of removing all the maximum-weight edges from the graph 
continues until the resulting graph is disconnected. 

In [3] and [4] the authors introduced the concept of SN-sets for 
a set of triplets x. A subset S of L(x) is an SN-set if there is no triplet 
ij\k e x such that i$S and j, k e S. In [4] it is shown that if X is 
dense then the maximal SiV-sets partition Lix) and can be found in 
polynomial time. By contracting each of the SN-set to a single 
node and assuming a common ancestor for all of these leaves, the 
size of the problem is reduced. In these papers, for finding the 
maximal SN-sets in polynomial time, the authors use the high 
density of the input triplet sets. TripNet algorithm uses the concept 
of height function as an auxiliary tool to obtain SN-sets instead of 
the high density assumption. 

Step 4. For each connected component obtained in Step 3 
which is not an SN-set, we apply Step 3. This process continues 
until all of the resulting components are SN-sets. Let {Si, S 2 , ■ 
Sk} be the set of resulting SN-sets. If each SN-set contains only one 
node, HBUILD is applied and if the result is a tree TripNet 
constructs a binary tree and goes to Step 6. Otherwise TripNet 
goes to Step 5. If for some i, | >1, contract each Si to a single 
node Si and set S = {ii, s 2 , s/,}. Update the set of triplets by 
defining x$ = {SiSj\Sf,: if 3 xy \z e x, x e Si, y e Sj and z e Sk}- 
Constructs a weighted complete graph (G s , w s ) with V(G S ) — S and 
ifs(Si, Sj) = min {h(x, y): xe S, and y e S } } . Set (G, w) = (G s , w s ) and 
TripNet goes to Step 3. 

The following theorem is a consequence of the definition SN-set 
for (Gg, Ws). 

Theorem 5. Applying Steps 3 and 4 on (Gs, Ws) and x^, each 
resulting SN-set has one member. 

Proof. Suppose that S = {s\, s 2 , s 3 , s r } is an SN-set in (Gs, 
Ws)- Now assume that in the procedure of Step 3 by removing the 
edges with weight /, Si separates from S 2 . Thus there exists k > I 
such that by removing the edges with weight at least k in (Gs, Ws), 
the connected component S separates from other components of 



Gs- It means that by removing the edges with weight at least k in 
G, we obtain the SN-set Si US2U • • • USr which is a contradic- 
tion. 

In the next step the reticulation leaves are recognized using the 
following three criteria: 

Criterion I. Let m, and M, be the minimum and maximum 
weight of the edges in (G,h) with exactly one end in S t . Choose the 
node with minimum m, and if there is more than one node with 
minimum then choose among them the nodes which has 
minimum M;. Let R\ denotes the set of such nodes. 

Criterion II. Let w,„ !M = min {w(.Si,Sj): l<ij <k}. In Gs 
consider the induced subgraph on the edges with the weight w m i n . 
Choose the nodes of R\ with the maximum degree in this induced 
subgraph. Let R 2 denotes the set of such nodes. 

Criterion III. For each node s e R 2 , remove it from GS and 
find SN-sets for this new graph using Steps 3 and 4. Let n s be the 
number of SiV-sets of this new graph with cardinality greater than 
one. Choose the nodes in R 2 with maximum n s . Let R 3 denotes the 
set of such nodes. 

We state an example to show the idea behind these three 
criteria. 

Let x = {ij | l,jk | i, kl \j, kl | i, no\m,lo\k,jl\o, mn \ I, mn \j, no\k, 
mo \ i,jk\n, ij\o, ik \m, il\n). 

x is not consistent with a tree but it is consistent with the 
network N shown in Figure 8a. Obviously, N is an optimal 
network consistent with x. In order to find SN-sets we construct 
G' z and (G, h), and find SJV-sets from (G, h) using Steps 3 and 4 
(Figures 8b to 8g). It follows that S = {{i}, {j}, {k}, {/}, {m}, {n, 
o}}. Now in Gs (Figure 8h). we expect that the reticulation is in 
R i . In this example both k and / are in R i . Also we expect that if 
there is a reticulation leaf, it belongs to R 2 which again both k and 
I are in R 2 . Now just / belongs to R$. Thus we consider / as the 
reticulation leaf (Figures 8i to 8n). Remove triplets from x s which 
contain / and denote the new set of triplets by x' s . Obviously x' s is 
consistent with a tree. We add this reticulation leaf to a 
binarization of T z < s such that the resulting network is consistent 
with x s . Note that if we consider each node except than / as the 
reticulation leaf then final network consistent with x s has at least 
two reticulation leaves. 

Step 5. In this step the reticulation leaf is recognized using 
three criteria. Do the criterion I. If | R i | =1 then choose the node 
x e Ri as the reticulation node. Otherwise if |i?i|>l do the 
criterion II. If | R 2 \ = 1 then choose the node x e R 2 as the 
reticulation node. Otherwise if |i? 2 |>l do the criterion III. If 
| Ri | = 1 then choose the node x e R$ as the reticulation node. 
Otherwise if | R$ \ > 1 then by the speed options we choose the 
reticulation node as follows. 

Slow. Each node in R 3 is examined as the reticulation leaf. 

Normal. Two nodes in R 3 are selected randomly and each of 
these two nodes is examined as the reticulation leaf. 

Fast. One node in R 3 is selected randomly as the reticulation 
leaf. 

Let x be a node which is considered as a reticulation leaf. 
Remove x from G s and all of the triplets which contain x from x s . 
Define G = G \ {x} and go to Step 3. 

Note that for the Fast option the running time of the algorithm 
is polynomial. 

For biological data almost always the criteria I and II find a 
unique reticulation leaf. 

So on real data the running time of TripNet is almost always 
polynomial. 

Step 6. Let X\, x 2 , x m be m reticulation leaves which are 
obtained in Step 5 with this order and T be the tree that is 
constructed in Step 4. Now add these m nodes in the reverse order 
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to T as what follows. Let e,\ and e 2 be two edges of T. Consider two 
new nodes j\ and)>2 in the middle of e\ and e 2 - Connect y i and}' 2 
to a new node 313 and connect the reticulation leaf x m to 3)3. Do this 
procedure for all pairs of edges and choose a pair such that the 
resulting network is consistent with maximum number of triplets in 
X. Continue this procedure until all the reticulation nodes are 
added. 

Step 7. For each SiV-set Si and the set z$, of triplets we run the 
algorithm again. 

Step 8. Replace each SN-set in the network of Step 6 with its 
related network constructed in Step 7 to obtain a network N' . 

Let x' e x be the set of the triplets which are not consistent with 
N' . For each pair of leaves a and b assume that x' 0 j is the set of 
triplets in x' which are of the form ab | c. Consider the pair of 
leaves i and j such that x'y has the maximum cardinality. Assume 
that pi and pj are the parents of i and j, respectively. 

Step 9. Create two new nodes in the middle of the edges p, i 
and pjj and connect them with a new edge. This new edge creates 
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